Building your Recurrent Neural Network - Step by Step

Welcome to Course 5's first assignment! In this assignment, you will implement key components of a Recurrent Neural Network in numpy.

Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have "memory". They can read inputs $x^{\langle t \rangle}$ (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a unidirectional RNN to take information from the past to process later inputs. A bidirectional RNN can take context from both the past and the future.

Notation:

Example:

Pre-requisites

Be careful when modifying the starter code

Updates for 3b

If you were working on the notebook before this update...

List of updates

Let's first import all the packages that you will need during this assignment.

1 - Forward propagation for the basic Recurrent Neural Network

Later this week, you will generate music using an RNN. The basic RNN that you will implement has the structure below. In this example, $T_x = T_y$.

**Figure 1**: Basic RNN model

Dimensions of input $x$

Input with $n_x$ number of units

Time steps of size $T_{x}$

Batches of size $m$

3D Tensor of shape $(n_{x},m,T_{x})$

Taking a 2D slice for each time step: $x^{\langle t \rangle}$

Definition of hidden state $a$

Dimensions of hidden state $a$

Dimensions of prediction $\hat{y}$

Here's how you can implement an RNN:

Steps:

  1. Implement the calculations needed for one time-step of the RNN.
  2. Implement a loop over $T_x$ time-steps in order to process all the inputs, one at a time.

1.1 - RNN cell

A recurrent neural network can be seen as the repeated use of a single cell. You are first going to implement the computations for a single time-step. The following figure describes the operations for a single time-step of an RNN cell.

**Figure 2**: Basic RNN cell. Takes as input $x^{\langle t \rangle}$ (current input) and $a^{\langle t - 1\rangle}$ (previous hidden state containing information from the past), and outputs $a^{\langle t \rangle}$ which is given to the next RNN cell and also used to predict $\hat{y}^{\langle t \rangle}$

rnn cell versus rnn_cell_forward

Exercise: Implement the RNN-cell described in Figure (2).

Instructions:

  1. Compute the hidden state with tanh activation: $a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$.
  2. Using your new hidden state $a^{\langle t \rangle}$, compute the prediction $\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$. We provided the function softmax.
  3. Store $(a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters)$ in a cache.
  4. Return $a^{\langle t \rangle}$ , $\hat{y}^{\langle t \rangle}$ and cache

Additional Hints

Expected Output:

a_next[4] = 
 [ 0.59584544  0.18141802  0.61311866  0.99808218  0.85016201  0.99980978
 -0.18887155  0.99815551  0.6531151   0.82872037]
a_next.shape = 
 (5, 10)
yt_pred[1] =
 [ 0.9888161   0.01682021  0.21140899  0.36817467  0.98988387  0.88945212
  0.36920224  0.9966312   0.9982559   0.17746526]
yt_pred.shape = 
 (2, 10)

1.2 - RNN forward pass

**Figure 3**: Basic RNN. The input sequence $x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ is carried over $T_x$ time steps. The network outputs $y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$.

Exercise: Code the forward propagation of the RNN described in Figure (3).

Instructions:

Additional Hints

Expected Output:

a[4][1] = 
 [-0.99999375  0.77911235 -0.99861469 -0.99833267]
a.shape = 
 (5, 10, 4)
y_pred[1][3] =
 [ 0.79560373  0.86224861  0.11118257  0.81515947]
y_pred.shape = 
 (2, 10, 4)
caches[1][1][3] =
 [-1.1425182  -0.34934272 -0.20889423  0.58662319]
len(caches) = 
 2

Congratulations! You've successfully built the forward propagation of a recurrent neural network from scratch.

Situations when this RNN will perform better:

In the next part, you will build a more complex LSTM model, which is better at addressing vanishing gradients. The LSTM will be better able to remember a piece of information and keep it saved for many timesteps.

2 - Long Short-Term Memory (LSTM) network

The following figure shows the operations of an LSTM-cell.

**Figure 4**: LSTM-cell. This tracks and updates a "cell state" or memory variable $c^{\langle t \rangle}$ at every time-step, which can be different from $a^{\langle t \rangle}$. Note, the $softmax^{*}$ includes a dense layer and softmax

Similar to the RNN example above, you will start by implementing the LSTM cell for a single time-step. Then you can iteratively call it from inside a "for-loop" to have it process an input with $T_x$ time-steps.

Overview of gates and states

- Forget gate $\mathbf{\Gamma}_{f}$

Equation
$$\mathbf{\Gamma}_f^{\langle t \rangle} = \sigma(\mathbf{W}_f[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_f)\tag{1} $$
Explanation of the equation:
Variable names in the code

The variable names in the code are similar to the equations, with slight differences.

Candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$

Equation
$$\mathbf{\tilde{c}}^{\langle t \rangle} = \tanh\left( \mathbf{W}_{c} [\mathbf{a}^{\langle t - 1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{c} \right) \tag{3}$$
Explanation of the equation
Variable names in the code

- Update gate $\mathbf{\Gamma}_{i}$

Equation
$$\mathbf{\Gamma}_i^{\langle t \rangle} = \sigma(\mathbf{W}_i[a^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_i)\tag{2} $$

Explanation of the equation
Variable names in code (Please note that they're different than the equations)

In the code, we'll use the variable names found in the academic literature. These variables don't use "u" to denote "update".

- Cell state $\mathbf{c}^{\langle t \rangle}$

Equation
$$ \mathbf{c}^{\langle t \rangle} = \mathbf{\Gamma}_f^{\langle t \rangle}* \mathbf{c}^{\langle t-1 \rangle} + \mathbf{\Gamma}_{i}^{\langle t \rangle} *\mathbf{\tilde{c}}^{\langle t \rangle} \tag{4} $$
Explanation of equation
Variable names and shapes in the code

- Output gate $\mathbf{\Gamma}_{o}$

Equation
$$ \mathbf{\Gamma}_o^{\langle t \rangle}= \sigma(\mathbf{W}_o[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{o})\tag{5}$$

Explanation of the equation
Variable names in the code

- Hidden state $\mathbf{a}^{\langle t \rangle}$

Equation
$$ \mathbf{a}^{\langle t \rangle} = \mathbf{\Gamma}_o^{\langle t \rangle} * \tanh(\mathbf{c}^{\langle t \rangle})\tag{6} $$
Explanation of equation
Variable names and shapes in the code

- Prediction $\mathbf{y}^{\langle t \rangle}_{pred}$

The equation is: $$\mathbf{y}^{\langle t \rangle}_{pred} = \textrm{softmax}(\mathbf{W}_{y} \mathbf{a}^{\langle t \rangle} + \mathbf{b}_{y})$$

Variable names and shapes in the code

2.1 - LSTM cell

Exercise: Implement the LSTM cell described in the Figure (4).

Instructions:

  1. Concatenate the hidden state $a^{\langle t-1 \rangle}$ and input $x^{\langle t \rangle}$ into a single matrix:
$$concat = \begin{bmatrix} a^{\langle t-1 \rangle} \\ x^{\langle t \rangle} \end{bmatrix}$$

  1. Compute all the formulas 1 through 6 for the gates, hidden state, and cell state.
  2. Compute the prediction $y^{\langle t \rangle}$.

Additional Hints

Expected Output:

a_next[4] = 
 [-0.66408471  0.0036921   0.02088357  0.22834167 -0.85575339  0.00138482
  0.76566531  0.34631421 -0.00215674  0.43827275]
a_next.shape =  (5, 10)
c_next[2] = 
 [ 0.63267805  1.00570849  0.35504474  0.20690913 -1.64566718  0.11832942
  0.76449811 -0.0981561  -0.74348425 -0.26810932]
c_next.shape =  (5, 10)
yt[1] = [ 0.79913913  0.15986619  0.22412122  0.15606108  0.97057211  0.31146381
  0.00943007  0.12666353  0.39380172  0.07828381]
yt.shape =  (2, 10)
cache[1][3] =
 [-0.16263996  1.03729328  0.72938082 -0.54101719  0.02752074 -0.30821874
  0.07651101 -1.03752894  1.41219977 -0.37647422]
len(cache) =  10

2.2 - Forward pass for LSTM

Now that you have implemented one step of an LSTM, you can now iterate this over this using a for-loop to process a sequence of $T_x$ inputs.

**Figure 5**: LSTM over multiple time-steps.

Exercise: Implement lstm_forward() to run an LSTM over $T_x$ time-steps.

Instructions

Expected Output:

a[4][3][6] =  0.172117767533
a.shape =  (5, 10, 7)
y[1][4][3] = 0.95087346185
y.shape =  (2, 10, 7)
caches[1][1][1] =
 [ 0.82797464  0.23009474  0.76201118 -0.22232814 -0.20075807  0.18656139
  0.41005165]
c[1][2][1] -0.855544916718
len(caches) =  2

Congratulations! You have now implemented the forward passes for the basic RNN and the LSTM. When using a deep learning framework, implementing the forward pass is sufficient to build systems that achieve great performance.

The rest of this notebook is optional, and will not be graded.

3 - Backpropagation in recurrent neural networks (OPTIONAL / UNGRADED)

In modern deep learning frameworks, you only have to implement the forward pass, and the framework takes care of the backward pass, so most deep learning engineers do not need to bother with the details of the backward pass. If however you are an expert in calculus and want to see the details of backprop in RNNs, you can work through this optional portion of the notebook.

When in an earlier course you implemented a simple (fully connected) neural network, you used backpropagation to compute the derivatives with respect to the cost to update the parameters. Similarly, in recurrent neural networks you can calculate the derivatives with respect to the cost in order to update the parameters. The backprop equations are quite complicated and we did not derive them in lecture. However, we will briefly present them below.

Note that this notebook does not implement the backward path from the Loss 'J' backwards to 'a'. This would have included the dense layer and softmax which are a part of the forward path. This is assumed to be calculated elsewhere and the result passed to rnn_backward in 'da'. It is further assumed that loss has been adjusted for batch size (m) and division by the number of examples is not required here.

This section is optional and ungraded. It is more difficult and has fewer details regarding its implementation. This section only implements key elements of the full path.

3.1 - Basic RNN backward pass

We will start by computing the backward pass for the basic RNN-cell and then in the following sections, iterate through the cells.


**Figure 6**: RNN-cell's backward pass. Just like in a fully-connected neural network, the derivative of the cost function $J$ backpropagates through the time steps of the RNN by following the chain-rule from calculus. Internal to the cell, the chain-rule is also used to calculate $(\frac{\partial J}{\partial W_{ax}},\frac{\partial J}{\partial W_{aa}},\frac{\partial J}{\partial b})$ to update the parameters $(W_{ax}, W_{aa}, b_a)$. The operation can utilize the cached results from the forward path.

Recall from lecture, the shorthand for the partial derivative of cost relative to a variable is dVariable. For example, $\frac{\partial J}{\partial W_{ax}}$ is $dW_{ax}$. This will be used throughout the remaining sections.


**Figure 7**: This implementation of rnn_cell_backward does **not** include the output dense layer and softmax which are included in rnn_cell_forward. $da_{next}$ is $\frac{\partial{J}}{\partial a^{\langle t \rangle}}$ and includes loss from previous stages and current stage output logic. The addition shown in green will be part of your implementation of rnn_backward.
Equations

To compute the rnn_cell_backward you can utilize the following equations. It is a good exercise to derive them by hand. Here, $*$ denotes element-wise multiplication while the absence of a symbol indicates matrix multiplication.

\begin{align} \displaystyle a^{\langle t \rangle} &= \tanh(W_{ax} x^{\langle t \rangle} + W_{aa} a^{\langle t-1 \rangle} + b_{a})\tag{-} \\[8pt] \displaystyle \frac{\partial \tanh(x)} {\partial x} &= 1 - \tanh^2(x) \tag{-} \\[8pt] \displaystyle {dW_{ax}} &= (da_{next} * ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) )) x^{\langle t \rangle T}\tag{1} \\[8pt] \displaystyle dW_{aa} &= (da_{next} * ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) )) a^{\langle t-1 \rangle T}\tag{2} \\[8pt] \displaystyle db_a& = \sum_{batch}( da_{next} * ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) ))\tag{3} \\[8pt] \displaystyle dx^{\langle t \rangle} &= { W_{ax}}^T (da_{next} * ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) ))\tag{4} \\[8pt] \displaystyle da_{prev} &= { W_{aa}}^T(da_{next} * ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) ))\tag{5} \end{align}

Implementing rnn_cell_backward

The results can be computed directly by implementing the equations above. However, the above can optionally be simplified by computing 'dz' and utlilizing the chain rule.
This can be further simplified by noting that $\tanh(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a})$ was computed and saved in the forward pass.

To calculate dba, the 'batch' above is a sum across all 'm' examples (axis= 1). Note that you should use the keepdims = True option.

It may be worthwhile to review Course 1 Derivatives with a computational graph through Backpropagation Intuition, which decompose the calculation into steps using the chain rule.
Matrix vector derivatives are described here, though the equations above incorporate the required transformations.

Note rnn_cell_backward does not include the calculation of loss from $y \langle t \rangle$, this is incorporated into the incoming da_next. This is a slight mismatch with rnn_cell_forward which includes a dense layer and softmax.

Note: in the code:
$\displaystyle dx^{\langle t \rangle}$ is represented by dxt,
$\displaystyle d W_{ax}$ is represented by dWax,
$\displaystyle da_{prev}$ is represented by daprev,
$\displaystyle dW
{aa}$ is represented by dWaa, $\displaystyle db_{a}$ is represented by dba,
dz is not derived above but can optionally be derived by students to simplify the repeated calculations.

Expected Output:

**gradients["dxt"][1][2]** = -1.3872130506
**gradients["dxt"].shape** = (3, 10)
**gradients["da_prev"][2][3]** = -0.152399493774
**gradients["da_prev"].shape** = (5, 10)
**gradients["dWax"][3][1]** = 0.410772824935
**gradients["dWax"].shape** = (5, 3)
**gradients["dWaa"][1][2]** = 1.15034506685
**gradients["dWaa"].shape** = (5, 5)
**gradients["dba"][4]** = [ 0.20023491]
**gradients["dba"].shape** = (5, 1)

Backward pass through the RNN

Computing the gradients of the cost with respect to $a^{\langle t \rangle}$ at every time-step $t$ is useful because it is what helps the gradient backpropagate to the previous RNN-cell. To do so, you need to iterate through all the time steps starting at the end, and at each step, you increment the overall $db_a$, $dW_{aa}$, $dW_{ax}$ and you store $dx$.

Instructions:

Implement the rnn_backward function. Initialize the return variables with zeros first and then loop through all the time steps while calling the rnn_cell_backward at each time timestep, update the other variables accordingly.

Expected Output:

**gradients["dx"][1][2]** = [-2.07101689 -0.59255627 0.02466855 0.01483317]
**gradients["dx"].shape** = (3, 10, 4)
**gradients["da0"][2][3]** = -0.314942375127
**gradients["da0"].shape** = (5, 10)
**gradients["dWax"][3][1]** = 11.2641044965
**gradients["dWax"].shape** = (5, 3)
**gradients["dWaa"][1][2]** = 2.30333312658
**gradients["dWaa"].shape** = (5, 5)
**gradients["dba"][4]** = [-0.74747722]
**gradients["dba"].shape** = (5, 1)

3.2 - LSTM backward pass

3.2.1 One Step backward

The LSTM backward pass is slightly more complicated than the forward pass.


**Figure 8**: lstm_cell_backward. Note the output functions, while part of the lstm_cell_forward, are not included in lstm_cell_backward

The equations for the LSTM backward pass are provided below. (If you enjoy calculus exercises feel free to try deriving these from scratch yourself.)

3.2.2 gate derivatives

Note the location of the gate derivatives ($\gamma$..) between the dense layer and the activation function (see graphic above). This is convenient for computing parameter derivatives in the next step. \begin{align} d\gamma_o^{\langle t \rangle} &= da_{next}*\tanh(c_{next}) * \Gamma_o^{\langle t \rangle}*\left(1-\Gamma_o^{\langle t \rangle}\right)\tag{7} \\[8pt] dp\widetilde{c}^{\langle t \rangle} &= \left(dc_{next}*\Gamma_u^{\langle t \rangle}+ \Gamma_o^{\langle t \rangle}* (1-\tanh^2(c_{next})) * \Gamma_u^{\langle t \rangle} * da_{next} \right) * \left(1-\left(\widetilde c^{\langle t \rangle}\right)^2\right) \tag{8} \\[8pt] d\gamma_u^{\langle t \rangle} &= \left(dc_{next}*\widetilde{c}^{\langle t \rangle} + \Gamma_o^{\langle t \rangle}* (1-\tanh^2(c_{next})) * \widetilde{c}^{\langle t \rangle} * da_{next}\right)*\Gamma_u^{\langle t \rangle}*\left(1-\Gamma_u^{\langle t \rangle}\right)\tag{9} \\[8pt] d\gamma_f^{\langle t \rangle} &= \left(dc_{next}* c_{prev} + \Gamma_o^{\langle t \rangle} * (1-\tanh^2(c_{next})) * c_{prev} * da_{next}\right)*\Gamma_f^{\langle t \rangle}*\left(1-\Gamma_f^{\langle t \rangle}\right)\tag{10} \end{align}

3.2.3 parameter derivatives

$ dW_f = d\gamma_f^{\langle t \rangle} \begin{bmatrix} a_{prev} \\ x_t\end{bmatrix}^T \tag{11} $ $ dW_u = d\gamma_u^{\langle t \rangle} \begin{bmatrix} a_{prev} \\ x_t\end{bmatrix}^T \tag{12} $ $ dW_c = dp\widetilde c^{\langle t \rangle} \begin{bmatrix} a_{prev} \\ x_t\end{bmatrix}^T \tag{13} $ $ dW_o = d\gamma_o^{\langle t \rangle} \begin{bmatrix} a_{prev} \\ x_t\end{bmatrix}^T \tag{14}$

To calculate $db_f, db_u, db_c, db_o$ you just need to sum across all 'm' examples (axis= 1) on $d\gamma_f^{\langle t \rangle}, d\gamma_u^{\langle t \rangle}, dp\widetilde c^{\langle t \rangle}, d\gamma_o^{\langle t \rangle}$ respectively. Note that you should have the keepdims = True option.

$\displaystyle db_f = \sum_{batch}d\gamma_f^{\langle t \rangle}\tag{15}$ $\displaystyle db_u = \sum_{batch}d\gamma_u^{\langle t \rangle}\tag{16}$ $\displaystyle db_c = \sum_{batch}d\gamma_c^{\langle t \rangle}\tag{17}$ $\displaystyle db_o = \sum_{batch}d\gamma_o^{\langle t \rangle}\tag{18}$

Finally, you will compute the derivative with respect to the previous hidden state, previous memory state, and input.

$ da_{prev} = W_f^T d\gamma_f^{\langle t \rangle} + W_u^T d\gamma_u^{\langle t \rangle}+ W_c^T dp\widetilde c^{\langle t \rangle} + W_o^T d\gamma_o^{\langle t \rangle} \tag{19}$

Here, to account for concatenation, the weights for equations 19 are the first n_a, (i.e. $W_f = W_f[:,:n_a]$ etc...)

$ dc_{prev} = dc_{next}*\Gamma_f^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} * (1- \tanh^2(c_{next}))*\Gamma_f^{\langle t \rangle}*da_{next} \tag{20}$

$ dx^{\langle t \rangle} = W_f^T d\gamma_f^{\langle t \rangle} + W_u^T d\gamma_u^{\langle t \rangle}+ W_c^T dp\widetilde c^{\langle t \rangle} + W_o^T d\gamma_o^{\langle t \rangle}\tag{21} $

where the weights for equation 21 are from n_a to the end, (i.e. $W_f = W_f[:,n_a:]$ etc...)

Exercise: Implement lstm_cell_backward by implementing equations $7-21$ below.

Note: In the code:

$d\gamma_o^{\langle t \rangle}$ is represented by dot,
$dp\widetilde{c}^{\langle t \rangle}$ is represented by dcct,
$d\gamma_u^{\langle t \rangle}$ is represented by dit,
$d\gamma_f^{\langle t \rangle}$ is represented by dft

Expected Output:

**gradients["dxt"][1][2]** = 3.23055911511
**gradients["dxt"].shape** = (3, 10)
**gradients["da_prev"][2][3]** = -0.0639621419711
**gradients["da_prev"].shape** = (5, 10)
**gradients["dc_prev"][2][3]** = 0.797522038797
**gradients["dc_prev"].shape** = (5, 10)
**gradients["dWf"][3][1]** = -0.147954838164
**gradients["dWf"].shape** = (5, 8)
**gradients["dWi"][1][2]** = 1.05749805523
**gradients["dWi"].shape** = (5, 8)
**gradients["dWc"][3][1]** = 2.30456216369
**gradients["dWc"].shape** = (5, 8)
**gradients["dWo"][1][2]** = 0.331311595289
**gradients["dWo"].shape** = (5, 8)
**gradients["dbf"][4]** = [ 0.18864637]
**gradients["dbf"].shape** = (5, 1)
**gradients["dbi"][4]** = [-0.40142491]
**gradients["dbi"].shape** = (5, 1)
**gradients["dbc"][4]** = [ 0.25587763]
**gradients["dbc"].shape** = (5, 1)
**gradients["dbo"][4]** = [ 0.13893342]
**gradients["dbo"].shape** = (5, 1)

3.3 Backward pass through the LSTM RNN

This part is very similar to the rnn_backward function you implemented above. You will first create variables of the same dimension as your return variables. You will then iterate over all the time steps starting from the end and call the one step function you implemented for LSTM at each iteration. You will then update the parameters by summing them individually. Finally return a dictionary with the new gradients.

Instructions: Implement the lstm_backward function. Create a for loop starting from $T_x$ and going backward. For each step call lstm_cell_backward and update the your old gradients by adding the new gradients to them. Note that dxt is not updated but is stored.

Expected Output:

**gradients["dx"][1][2]** = [0.00218254 0.28205375 -0.48292508 -0.43281115]
**gradients["dx"].shape** = (3, 10, 4)
**gradients["da0"][2][3]** = 0.312770310257
**gradients["da0"].shape** = (5, 10)
**gradients["dWf"][3][1]** = -0.0809802310938
**gradients["dWf"].shape** = (5, 8)
**gradients["dWi"][1][2]** = 0.40512433093
**gradients["dWi"].shape** = (5, 8)
**gradients["dWc"][3][1]** = -0.0793746735512
**gradients["dWc"].shape** = (5, 8)
**gradients["dWo"][1][2]** = 0.038948775763
**gradients["dWo"].shape** = (5, 8)
**gradients["dbf"][4]** = [-0.15745657]
**gradients["dbf"].shape** = (5, 1)
**gradients["dbi"][4]** = [-0.50848333]
**gradients["dbi"].shape** = (5, 1)
**gradients["dbc"][4]** = [-0.42510818]
**gradients["dbc"].shape** = (5, 1)
**gradients["dbo"][4]** = [ -0.17958196]
**gradients["dbo"].shape** = (5, 1)

Congratulations !

Congratulations on completing this assignment. You now understand how recurrent neural networks work!

Let's go on to the next exercise, where you'll use an RNN to build a character-level language model.